Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Kraken2] Split database from Kraken2 TheiaCoV task #608

Closed
wants to merge 56 commits into from

Conversation

jrotieno
Copy link
Contributor

@jrotieno jrotieno commented Sep 6, 2024

This PR closes #563, closes #520 and closes #402

🗑️ This dev branch should be deleted after merging to main.

🧠 Summary

This PR addresses several issues that came up when the RSV target organism was set differently for RSV-A and RSV-B, i.e. rsv_a_kraken_target_organism = "Respiratory syncytial virus" and rsv_b_kraken_target_organism = "Human orthopneumovirus" due to the use of an old database. As the existing docker image had an embedded database, it meant that each time the Kraken database would need to be updated, it would have to be done inside the docker container, which would not be efficient. The TheiaProk workflows were already using a standalone database and a docker image free of a database, allowing for flexibility in user input for a preferred database.

We have therefore implemented the same strategy to the TheiaCoV workflows, using the standalone Kraken2 task, and a default custom built Kraken2 database comprising Human and Viral sequences.

Further, for consistency, we have made all Kraken 2 outputs to be prefixed by "Kraken2", which previously had a mix of Kraken2_ and Kraken_

⚡ Impacted Workflows/Tasks

This PR may lead to different results in pre-existing outputs: Yes
Use of an updated database may reflect changes in how taxa are identified.

This PR uses an element that could cause duplicate runs to have different results: No

🛠️ Changes

⚙️ Algorithm

We have removed the kraken2_theiacov task and instead using the kraken2_standalone task for the TheiaCoV workflows

➡️ Inputs

The TheiaCoV workflows, Illumina PE, Illumina SE, and ONT, have a default Kraken2 database that includes Human and Viral sequences from NCBI
File kraken_db = "gs://theiagen-large-public-files-rp/terra/databases/kraken2/kraken2_humanGRCh38_viralRefSeq_20240828.tar.gz"

⬅️ Outputs

All Kraken2 outputs are now prefixed with "Kraken2_"
Float? kraken2_human
Float? kraken2_human_dehosted
Float? kraken2_sc2
Float? kraken2_sc2_dehosted
String? kraken2_target_organism
String? kraken2_target_organism_dehosted
String? kraken2_target_organism_name
String? kraken2_version
File? kraken2_report
File? kraken2_report_dehosted
String? kraken2_docker
String? kraken2_database

🧪 Testing

The database was tested locally with the standalone task and the custom database and worked as expected.

Further testing was successfully done on Terra as follows:
Bacterial pathogens and TheiaProk workflows: call_kraken=true and kraken_db=gs://theiagen-large-public-files-rp/terra/databases/kraken2/kraken2_humanGRCh38_viralRefSeq_20240828.tar.gz
TheiaProk_Illumina_PE_PHB: https://app.terra.bio/#workspaces/theiagen-training-workspaces/Theiagen_Otieno_Sandbox/job_history/10eddd67-5216-4529-b355-8a14947de621
TheiaProk_ONT_PHB: https://app.terra.bio/#workspaces/theiagen-training-workspaces/Theiagen_Otieno_Sandbox/job_history/7206c5d6-fd58-448c-becf-1abb4ed27c5c

Viral Pathogens and TheiaCoV workflows
TheiaCoV_Illumina_PE_PHB: https://app.terra.bio/#workspaces/theiagen-training-workspaces/Theiagen_Otieno_Sandbox/job_history/7212fba6-b533-4afc-882f-a0aebc5a9157
TheiaCoV_Illumina_SE_PHB: https://app.terra.bio/#workspaces/theiagen-training-workspaces/Theiagen_Otieno_Sandbox/job_history/03a0096f-7636-47fe-be06-3446596f7daf
TheiaCoV_ONT_PHB: https://app.terra.bio/#workspaces/theiagen-training-workspaces/Theiagen_Otieno_Sandbox/job_history/11f05bb7-022c-4d44-914b-992614202e46

RSV and TheiaCoV_Illumina_PE_PHB
RSV-A: https://app.terra.bio/#workspaces/theiagen-training-workspaces/Theiagen_Otieno_Sandbox/job_history/8f315934-0771-40d9-914a-19abc7a9413d
best target for identification is "Human respiratory syncytial virus A"

  0.18	2637	2637	U	0	unclassified
 99.82	1499711	34	R	1	root
 99.78	1499055	11	D	10239	  Viruses
 99.77	1498888	0	D1	2559587	    Riboviria
 99.77	1498888	0	K	2732396	      Orthornavirae
 99.77	1498888	0	P	2497569	        Negarnaviricota
 99.77	1498888	0	P1	2497570	          Haploviricotina
 99.77	1498888	0	C	2497574	            Monjiviricetes
 99.77	1498888	0	O	11157	              Mononegavirales
 99.77	1498888	0	F	11244	                Pneumoviridae
 99.77	1498888	0	G	1868215	                  Orthopneumovirus
 99.77	1498887	0	S	3049954	                    Orthopneumovirus hominis
 99.77	1498887	967	S1	11250	                      human respiratory syncytial virus
 99.71	1497920	1497920	S2	208893	                        Human respiratory syncytial virus A

RSV-B: https://app.terra.bio/#workspaces/theiagen-training-workspaces/Theiagen_Otieno_Sandbox/job_history/73211a8e-21cb-4549-b424-01a7d636877a
best target for identification is "human respiratory syncytial virus"

1.58	6378	6378	U	0	unclassified
98.42	396488	9	R	1	root
98.36	396273	6	D	10239	  Viruses
98.31	396039	0	D1	2559587	    Riboviria
98.31	396039	1	K	2732396	      Orthornavirae
98.30	396037	0	P	2497569	        Negarnaviricota
98.30	396029	0	P1	2497570	          Haploviricotina
98.30	396029	0	C	2497574	            Monjiviricetes
98.30	396029	0	O	11157	              Mononegavirales
98.30	396029	1	F	11244	                Pneumoviridae
98.30	396028	2	G	1868215	                  Orthopneumovirus
98.30	396026	0	S	3049954	                    Orthopneumovirus hominis
98.30	396026	382062	S1	11250	                      human respiratory syncytial virus
 3.47	13964	13964	S2	208893	                        Human respiratory syncytial virus A

RSV and TheiaCoV_Illumina_PE_PHB with updated Kraken target organism identifier
RSV-A: https://app.terra.bio/#workspaces/theiagen-training-workspaces/Theiagen_Otieno_Sandbox/job_history/3546ab3d-c2cd-4c1e-ae3a-da78754733e7

RSV-B: https://app.terra.bio/#workspaces/theiagen-training-workspaces/Theiagen_Otieno_Sandbox/job_history/ce3d507f-05df-403f-812e-7f781f761364

Bacterial and Viral Pathogens and Kraken2 standalone workflows
Kraken2_PE_PHB: https://app.terra.bio/#workspaces/theiagen-training-workspaces/Theiagen_Otieno_Sandbox/job_history/ac5dc25b-a305-4ca2-be52-3845eb683d4e
Kraken2_SE_PHB: https://app.terra.bio/#workspaces/theiagen-training-workspaces/Theiagen_Otieno_Sandbox/job_history/fa7e5754-85de-4b34-929c-c11da23ab3d5
Kraken2_ONT_PHB: https://app.terra.bio/#workspaces/theiagen-training-workspaces/Theiagen_Otieno_Sandbox/job_history/6845c8eb-ee26-402f-ac11-550591f7a7ac

Suggested Scenarios for Reviewer to Test

🔬 Final Developer Checklist

  • The workflow/task has been tested and results, including file contents, are as anticipated
  • The CI/CD has been adjusted and tests are passing (Theiagen developers)
  • Code changes follow the style guide
  • Documentation and/or workflow diagrams have been updated if applicable (Theiagen developers only)

🎯 Reviewer Checklist

  • All changed results have been confirmed
  • You have tested the PR appropriately (see the testing guide for more information)
  • All code adheres to the style guide
  • MD5 sums have been updated
  • The PR author has addressed all comments
  • The documentation has been updated

@cimendes
Copy link
Member

I think this is looking pretty okay. It's a quite sizable PR as a lot of issues started to get pooled into the same PR. In short, the main changes are:

  • the kraken2 tasks just for theiacov went bye bye. Now all workflows use the kraken standalone task
  • New kraken2 viral+human database, constructed by theiagen (default viral doesn't include human which we report on)
  • task names, inputs and outputs have been harmonized
  • new logic implemented to only report % of sc2 if the organism is sc2. otherwise report % target organanism.

Additional tests that could be done that I personally didn't check but James did:

  • TheiaProk and TheiaMeta remain unaffected by these changes.

@@ -227,8 +230,8 @@ workflow theiacov_illumina_pe {
num_reads_raw2 = read_QC_trim.fastq_scan_raw2,
num_reads_clean1 = read_QC_trim.fastq_scan_clean1,
num_reads_clean2 = read_QC_trim.fastq_scan_clean2,
kraken_human = read_QC_trim.kraken_human,
kraken_human_dehosted = read_QC_trim.kraken_human_dehosted,
kraken_human = read_QC_trim.kraken2_human,
Copy link
Member

@sage-wright sage-wright Oct 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you update the qc_check_task to use kraken2 as well for consistency? probably going to have to make this change in the export_taxon_table task as well

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kraken2 is kraken2 everywhere now

@@ -37,6 +37,8 @@ workflow theiacov_clearlabs {
String? target_organism
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be kraken2_target_organism? i feel like we discussed this at one point in time and cannot remember the conclusion

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a conflict with the output declaration in this particular case so I'm unsure how to proceed here. I0m leaving it as target_organism until we can discuss it further

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on theiacov_ont is also target_organism

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would there be an instance we'd want this input to feed into multiple tasks, to ensure consistency? My gut reaction to having workflow level inputs be specific to tasks, is we should only do that if we could/would not use it in multiple workflows. This, to me, seems like an instance we might want to leave it without any tool name to allow for just that. Maybe worth some discussion? Ultimately, when we get to the PHB-lite effort, this will get abstracted away, but something to consider in the interim.

Copy link
Member

@sage-wright sage-wright Oct 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty sure that's what we're using the organism_parameter workflow to do in order to standardize the language here. This input is piped into organism_parameter and that output is fed into all downstream tasks

Could you help me understand what you mean by having workflow level inputs specific to tasks if they're not used in multiple workflows?

If I'm understanding correctly, then we might be better served by removing all parameters that can be modified in the organism_parameters workflow so that they're not workflow-level inputs

Copy link
Member

@sage-wright sage-wright left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks great -- one question: if our aim is to standardize the use of kraken2 vs kraken, we should probably use kraken2 everywhere and not pick and choose.

Please go through and standardize all the kraken instances to kraken2 in both docs and all workflows & tasks.

@sage-wright
Copy link
Member

sage-wright commented Oct 22, 2024

Testing TheiaProk PE here, TheiaMeta here and Kraken2_PE here

@cimendes
Copy link
Member

Other impacted workflows that require testing:

  • Freyja_FASTQ
  • NCBI_Scrub_PE
  • NCBI_Scrub_SE

@cimendes
Copy link
Member

Okay, I think everything is ready for retesting!!

  • TheiaCoV
    • PE
    • SE
    • ONT
  • TheiaMeta
  • Kraken2
    • PE
    • SE
    • ONT
  • NCBI-SRUB
    • PE
    • SE
  • Freyja_FASTQ

Copy link
Contributor

@AndrewLangvt AndrewLangvt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cimendes I know this will be a pain to go back and refactor, but we need to avoid changing workflow-level output namespaces for modifications such as "harmonizing" the use of kraken/kraken2. i.e. have what is output to the Terra table remain the same, but we can change what defines that variable.
For example

Float kraken2_human_dehosted = kraken2.kraken2_percent_human

Float kraken_human_dehosted = kraken2.percent_human
# should become 
Float kraken_human_dehosted = kraken2.kraken2_percent_human
# should NOT become 
Float kraken2_human_dehosted = kraken2.kraken2_percent_human

Changing all instances of kraken, including the output namespaces, introduces a backwards incompatible change that would pose some pain points to the end user. I know some of this may work may have begun with James, but let's ditch those, where possible and only make those changes "under the hood" so our users are unimpacted. Happy to discuss more, if you like.

@cimendes cimendes marked this pull request as draft November 4, 2024 19:10
@cimendes cimendes closed this Nov 7, 2024
@sage-wright sage-wright deleted the jro-kraken-split-database-and-task branch November 7, 2024 20:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants